Crash Report Sampling System

Introduction

The CRSS, also known as the Crash Report Sampling System, is a sample of police reported crashes involving motor vehicles, bicycles, and pedestrians. It includes accidents that have only caused damage to the vehicle to accidents that have led to fatalities. Since the early 1970s, the National Highway Traffic Safety Administration has been collecting data on these crashes to reduce the number of accidents, injuries, and deaths on America’s highways. It is used to estimate the overall crash, identify areas on highways that may not be safe, measure trends, and help create new safety initiatives. The CRSS gets data from a nationally representative probability sample from millions of police reported crashes, and concentrates on crashes that have been the greatest concern to the general public. To be eligible for the CRSS sample, the police report must have been signed by a police officer, involve at least one vehicle traveling on the highway, and must involve damage to property, injury, or death. The reports picked for the CRSS are across 60 areas in the United States that reflect the country’s geography and population. Trained CRSS coders will then take the report, import it into a data file, and then do a quality check, to ensure the data is valid. After this, the data is public for people like myself to analyze. I believe the most important part of the CRSS is what we get from the data. Using this data, people in the NHTSA are able to understand which traffic regulations and highway safety programs are effective, leading to better roads and less crashes.

For my project, I decided to see if there was any correlation between a vehicle’s make, and the likelihood of crashing in the rain. I wanted to test if cars with a certain reputation were safer than cars that are more common on our streets. I decided to test BMWs, Audis, Nissans, Toyotas, Hyundais, and Mercedes-Benz. I imported data from the CRSS for 2020-2022, and cleaned it to find only variables I needed. I used the accident dataset, the vehicles dataset, and the weather dataset. I then used a few statistical tests to find out if the correlation was strong or weak.

References

Source National Highway Traffic Safety Administration. Crash Report Sampling System. U.S. Department of Transportation, www.nhtsa.gov/crash-data-systems/crash-report-sampling-system. Accessed 22 Nov. 2024

Coding

library(dplyr) #Data wrangling tools
library(ggplot2) # Used for graphing
library(RColorBrewer) # Color Palettes for Graphs 
library(plotly) #Interactive plots
library(patchwork) #Show charts together 
library(vcd) #Cramer V's table
### Reading Data #### 
vehicles2022 <- read.csv("~/Downloads/CRSS2022CSV/vehicle.csv")
vehicles2021 <- read.csv("~/Downloads/CRSS2021CSV/vehicle.csv")
vehicles2020 <- read.csv("~/Downloads/CRSS2020CSV/vehicle.csv")
weather2022 <- read.csv("~/Downloads/CRSS2022CSV/weather.csv")
weather2021 <- read.csv("~/Downloads/CRSS2021CSV/weather.csv")
weather2020 <- read.csv("~/Downloads/CRSS2020CSV/weather.csv")
accident2020 <- read.csv("~/Downloads/CRSS2020CSV/accident.csv")
accident2021 <- read.csv("~/Downloads/CRSS2021CSV/accident.csv")
accident2022 <- read.csv("~/Downloads/CRSS2022CSV/accident.csv")


# 2020 data cleaning ####

cleanweather2020 <- weather2020 %>% 
  select(CASENUM, WEATHERNAME) %>% #Selecting columns I need
  rename(casenum = CASENUM, weather = WEATHERNAME) %>% #Simplifying column names 
  mutate(weather = recode(weather, "Not Reported" = "NR", "Freezing Rain or Drizzle" = "Rain", "Reported as Unknown" = "NR", 
                          "Fog, Smog, Smoke" = "Smoke", "Severe Crosswinds" = "Windy", 
                          "Blowing Snow" = "Snow")) #Making names smaller for cleaner graph 

 

# 2021 data cleaning ####
cleanweather2021 <- weather2021 %>% 
  select(CASENUM, WEATHERNAME) %>% #Selecting columns I need
  rename(casenum = CASENUM, weather = WEATHERNAME) %>% #Simplifying column names 
 mutate(weather = recode(weather, "Not Reported" = "NR", "Freezing Rain or Drizzle" = "Rain", "Reported as Unknown" = "NR", 
                          "Fog, Smog, Smoke" = "Smoke", "Severe Crosswinds" = "Windy", 
                          "Blowing Snow" = "Snow", "Blowing Sand, Soil, Dirt" = "Windy"))

# 2022 data cleaning ####
cleanweather2022 <- weather2022 %>% 
  select(CASENUM, WEATHERNAME) %>% #Selecting columns I need
  rename(casenum = CASENUM, weather = WEATHERNAME) %>% #Simplifying column names 
  mutate(weather = recode(weather, "Not Reported" = "NR", "Freezing Rain or Drizzle" = "Rain", "Reported as Unknown" = "NR", 
                          "Fog, Smog, Smoke" = "Smoke", "Severe Crosswinds" = "Windy", 
                          "Blowing Snow" = "Snow", "Blowing Sand, Soil, Dirt" = "Windy"))


#  vehicles data cleaning #### 


vhmake2020 <- vehicles2020 %>% mutate(Year = 2020) %>% select(Year, CASENUM, MAKENAME) %>% rename(casenum = CASENUM, make = MAKENAME) %>% filter(make %in% c("Toyota", "Hyundai", "Nissan/Datsun", "BMW", "Audi", "Mercedes-Benz"))
vhmakewthr2020 <- left_join(vhmake2020, cleanweather2020, by = "casenum") 
## Warning in left_join(vhmake2020, cleanweather2020, by = "casenum"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 208 of `x` matches multiple rows in `y`.
## ℹ Row 13 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
vhmake2021 <- vehicles2021 %>% mutate(Year = 2021) %>% select(Year, CASENUM, MAKENAME) %>% rename(casenum = CASENUM, make = MAKENAME) %>% filter(make %in% c("Toyota", "Hyundai", "Nissan/Datsun", "BMW", "Audi", "Mercedes-Benz"))
vhmakewthr2021 <- left_join(vhmake2021, cleanweather2021, by = "casenum") 
## Warning in left_join(vhmake2021, cleanweather2021, by = "casenum"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 464 of `x` matches multiple rows in `y`.
## ℹ Row 4 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
vhmake2022 <- vehicles2022 %>% mutate(Year = 2022) %>% select(Year, CASENUM, MAKENAME) %>% rename(casenum = CASENUM, make = MAKENAME) %>% filter(make %in% c("Toyota", "Hyundai", "Nissan/Datsun", "BMW", "Audi", "Mercedes-Benz"))
vhmakewthr2022 <- left_join(vhmake2022, cleanweather2022, by = "casenum") 
## Warning in left_join(vhmake2022, cleanweather2022, by = "casenum"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 10 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
vhmakewthr <- rbind(vhmakewthr2020, vhmakewthr2021, vhmakewthr2022) 

#Creating count dataset for overall data
weather_countspop1 <- vhmakewthr %>%
  count(make, weather) %>% rename("count" = "n") %>% filter(weather %in% c("Rain", "Clear", "Cloudy", "Snow"))

# Creating my sample ####
sampled_data <- vhmakewthr %>%
  group_by(make) %>%                    # Group by the 'make' column
  sample_n(1600, replace = FALSE) %>%     #only 1666 Audis in the data, so must be less than 
  ungroup()

# Creating count for each make in weather with sample data
weather_countspop2 <- sampled_data %>%
  count(make, weather) %>% rename("count" = "n") %>% filter(weather %in% c("Rain", "Clear", "Cloudy", "Snow")) %>%
  mutate(percent = count / 1600) 


#Pie Chart sample data 
rain_crashes <- sampled_data %>%
  filter(weather == "Rain") %>%
  group_by(make) %>%
  summarise(count = n())  

clear_crashes <- sampled_data %>% 
  filter(weather == "Clear") %>% 
  group_by(make) %>% 
  summarise(count = n()) 

# Creating sample data with 4 main weather conditions

vhmakewthr2 <- sampled_data %>% filter(weather %in% c("Rain", "Clear", "Cloudy", "Snow"))

Tests

Contingency Table

contingency_weather <- sampled_data %>% select(weather, make) 

contingency_weather$weather <- ifelse(contingency_weather$weather == "Rain", 1, 0)

contingency_table <- table(contingency_weather$make, contingency_weather$make)

print(contingency_table)
##                
##                 Audi  BMW Hyundai Mercedes-Benz Nissan/Datsun Toyota
##   Audi          1600    0       0             0             0      0
##   BMW              0 1600       0             0             0      0
##   Hyundai          0    0    1600             0             0      0
##   Mercedes-Benz    0    0       0          1600             0      0
##   Nissan/Datsun    0    0       0             0          1600      0
##   Toyota           0    0       0             0             0   1600

Chi-squared statistic

chisq.test(contingency_table) #There is a correlation 
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 48000, df = 25, p-value < 2.2e-16

Cramer’s V

assocstats(contingency_table) #Cramer's V 
##                    X^2 df P(> X^2)
## Likelihood Ratio 34402 25        0
## Pearson          48000 25        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.913 
## Cramer's V        : 1

Logistics Model

model <- glm(weather ~ make, family = binomial(link = "logit"), data = contingency_weather) #Logistics 
summary(model) 
## 
## Call:
## glm(formula = weather ~ make, family = binomial(link = "logit"), 
##     data = contingency_weather)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.6765     0.1019 -26.275  < 2e-16 ***
## makeBMW             0.3552     0.1344   2.643  0.00821 ** 
## makeHyundai         0.3397     0.1347   2.522  0.01169 *  
## makeMercedes-Benz   0.1909     0.1385   1.378  0.16810    
## makeNissan/Datsun   0.3475     0.1346   2.583  0.00981 ** 
## makeToyota          0.1460     0.1397   1.045  0.29601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5362.1  on 9599  degrees of freedom
## Residual deviance: 5350.5  on 9594  degrees of freedom
## AIC: 5362.5
## 
## Number of Fisher Scoring iterations: 5

Graphs

Barplots

#Barplot for total vehicle make vs weather conditions
countvhm <- ggplot(weather_countspop1, aes(x = make, y = count, text = paste("Make: ", make, "<br>Weather: ", weather, "<br>Count: ", count))) +
  geom_bar(stat = "identity", fill = "pink", color = "black") +
  labs(
    title = "Number of Accidents per Vehicle Make Total",
    x = "Vehicle Make",
    y = "Number of Accidents"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

#Making it interactive with hover information 
plot1 <- ggplotly(countvhm, tooltip = "text")
plot1
# Stacked Bar Graph sample data vehicle make vs weather 
stacked <- ggplot(weather_countspop2, aes(x = make, y = count, fill = weather, text = paste("Make: ", make, "<br>Weather: ", weather, "<br>Count: ", count))) +
  geom_bar(stat = "identity", position = "stack") + 
  scale_fill_brewer(palette = "Pastel1") +# Use 'stat = "identity"' to specify that y is pre-calculated
  labs(title = "Rain vs Non-Rain Crashes by Car Make",
       x = "Car Make",
       y = "Count of Crashes") +
  theme_minimal()

plot2 <- ggplotly(stacked, tooltip = "text")
plot2 
# Distribution of Vehicle Makes by Rainy Condition Sample data
distrib <- ggplot(vhmakewthr2, aes(x = make, fill = make)) +
  geom_bar() +
  facet_wrap(~ weather) + 
  labs(title = "Distribution of Vehicle Makes by Rainy Condition",
       x = "Vehicle Make", y = "Number of Crashes") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Pastel1")

distrib

Pie Charts

pie_chart1 <- ggplot(rain_crashes, aes(x = "", y = count, fill = make, text = paste("Make: ", make, "<br>Count: ", count))) +
  geom_bar(stat = "identity", width = 1) +
  scale_fill_brewer(palette = "Pastel1") +
  coord_polar(theta = "y") +  # Convert bar chart to pie chart
  labs(title = "Proportion of Rain Crashes by Car Make") +
  theme_minimal() +
  theme(axis.text.x = element_blank())  # Remove x-axis labels (they are not needed for a pie chart)

pie_chart1

pie_chart2 <- ggplot(clear_crashes, aes(x = "", y = count, fill = make)) +
  geom_bar(stat = "identity", width = 1) +
  scale_fill_brewer(palette = "Pastel1") +
  coord_polar(theta = "y") +  # Convert bar chart to pie chart
  labs(title = "Proportion of Clear Crashes by Car Make") +
  theme_minimal() +
  theme(axis.text.x = element_blank())  # Remove x-axis labels (they are not needed for a pie chart)
pie_chart2

Results